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Abstract 

In this paper we consider the task of esti- 
mating the non-zero pattern of the sparse in- 
verse covariance matrix of a zero-mean Gaus- 
sian random vector from a set of iid samples. 
Note that this is also equivalent to recover- 
ing the underlying graph structure of a sparse 
Gaussian Markov Random Field (GMRF). 
We present two novel greedy approaches to 
solving this problem. The first estimates the 
non-zero covariates of the overall inverse co- 
variance matrix using a series of global for- 
ward and backward greedy steps. The sec- 
ond estimates the neighborhood of each node 
in the graph separately, again using greedy 
forward and backward steps, and combines 
the intermediate neighborhoods to form an 
overall estimate. The principal contribu- 
tion of this paper is a rigorous analysis of 
the sparsistency, or consistency in recover- 
ing the sparsity pattern of the inverse co- 
variance matrix. Surprisingly, we show that 
both the local and global greedy methods 
learn the full structure of the model with 
high probability given just 0{d\og{p)) sam- 
ples, which is a significant improvement over 
state of the art £i-regularized Gaussian MLE 
(Graphical Lasso) that requires 0{d'^ \og{p)) 
samples. Moreover, the restricted eigenvalue 
and smoothness conditions imposed by our 
greedy methods are much weaker than the 
strong irrepresentable conditions required by 
the £i-regularization based methods. We cor- 
roborate our results with extensive simula- 
tions and examples, comparing our local and 
global greedy methods to the £i -regularized 
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Gaussian MLE as well as the Neighbor- 
hood Greedy method to that of nodewise £i- 
regularized linear regression (Neighborhood 
Lasso). 



1 Introduction 

High- dimensional Covariance Estimation. Increas- 
ingly, modern statistical problems across varied fields 
of science and engineering involve a large number of 
variables. Estimation of such high-dimensional mod- 
els has been the focus of considerable recent research, 
and it is now well understood that consistent estima- 
tion is possible when some low-dimensional structure 
is imposed on the model space. In this paper, we con- 
sider the specific high-dimensional problem of recov- 
ering the covariance matrix of a zero-mean Gaussian 
random vector, under the low-dimensional structural 
constraint of sparsity of the inverse covariance, or con- 
centration matrix. When the random vector is mul- 
tivariate Gaussian, the set of non-zero entries in the 
concentration matrix correspond to the set of edges in 
an associated Gaussian Markov random field (GMRF). 
In this setting, imposing sparsity on the entries of the 
concentration matrix can be interpreted as requiring 
that the graph underlying the GMRF have relatively 
few edges. 

State of the art: £i regularized Gaussian MLE. For 
this task of sparse GMRF estimation, a line of recent 
papers [3l |5l [T5^ have proposed an estimator that min- 
imizes the Gaussian negative log-likelihood regular- 
ized by the £i norm of the entries (or the off-diagonal 
entries) of the concentration matrix. The resulting 
optimization problem is a log-determinant program, 
which can be solved in polynomial time with inte- 
rior point methods [1 , or by co-ordinate descent al- 
gorithms [3, 5 . Rothman et al. [JJ^, Ravikumar et al. 
[10 have also shown strong statistical guarantees for 
this estimator: both in £2 operator norm error bounds, 
and recovery of the underlying graph structure. 
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Recent resurgence of greedy methods. A related line of 
recent work on learning sparse models has focused on 
"stagewise" greedy algorithms. These perform simple 
forward steps (adding parameters greedily), and possi- 
bly also backward steps (removing parameters greed- 
ily), and yet provide strong statistical guarantees for 
the estimate after a finite number of greedy steps. In- 
deed, such greedy algorithms have appeared in various 
guises in multiple communities: in machine learning 
as boosting [4 , in function approximation [13], and in 
signal processing as basis pursuit [2] . In the context of 
statistical model estimation, Zhang [17] analyzed the 
forward greedy algorithm for the case of sparse lin- 
ear regression; and showed that the forward greedy 
algorithm is sparsistent (consistent for model selec- 
tion recovery) under the same "irrepresentable" con- 
dition as that required for "sparsistency" of the Lasso. 
Zhang [16] analyzes a more general greedy algorithm 
for sparse linear regression that performs forward and 
backward steps, and showed that it is sparsistent un- 
der a weaker restricted eigenvalue condition. Jalali 
et al. [7] extend the sparsistency analysis of [16] to 
general non-linear models, and again show that strong 
sparsistency guarantees hold for these algorithms. 

Our Approaches. Motivated by these recent results, we 
apply the forward-backward greedy algorithm studied 
in [161 III tdisk of learning the graph structure of 

a Gaussian Markov random field given iid samples. We 
propose two algorithms: one that applies the greedy 
algorithm to the overall Gaussian log-likelihood loss, 
and the other that is based on greedy neighborhood 
estimation. For this second method, we follow [3 [9], 
and estimate the neighborhood of each node by ap- 
plying the greedy algorithm to the local node condi- 
tional log-likelihood loss (which reduces to the least 
squares loss), and then show that each neighborhood 
is recovered with very high probability, so that by an 
elementary union bound, the entire graph structure 
is recovered with high probability. A principal con- 
tribution of this paper is a rigorous analysis of these 
algorithms, where we report sufficient conditions for 
recovery of the underlying graph structure. We also 
corroborate our analysis with extensive simulations. 

Our analysis shows that for a Gaussian random vec- 
tor X = (Xi, X2, . . . , Xp) with p variables, both 
the global and local greedy algorithms only require 
n = 0{d\og{p)) samples for sparsistent graph recov- 
ery. Note that this is a significant improvement over 
the ii regularized Gaussian MLE [15 which has been 
shown to require 0{d'^log{p)) samples [10 . Moreover, 
we show that the local and global greedy algorithms re- 
quire a very weak restricted eigenvalue and restricted 
smoothness condition on the true inverse covariance 
matrix (with the local greedy imposing a marginally 



weaker condition that the global greedy algorithm). 
This is in contrast to the £1 regularized Gaussian 
MLE which imposes a very stringent edge-based ir- 
representable condition [10] . In Section [sj we explic- 
itly compare these different conditions imposed by the 
various methods for some simple GMRFs, and quan- 
titatively show that the conditions imposed by the lo- 
cal and global greedy methods require much weaker 
conditions on the covariance entries. Thus, both the- 
oretically and via simulations, we show that the set of 
methods proposed in the paper are the state of the art 
in recovering the graph structure of a GMRF from iid 
samples: both in the number of samples required, and 
the weakness of the sufficient conditions imposed upon 
the model. 

2 Problem Setup 

2.1 Gaussian graphical models 

Let X — (Xi, X2, . . . , Xp) be a zero- mean Gaus- 
sian random vector. Its density is parameterized by 
its inverse covariance or concentration matrix 6* = 
(S*)"^ 0, and can be written as 



/(xi,...,Xp;e*) = 



exp { - ^x^6*x} 
V'(27r)^^det(e^ 



(1) 



We can associate an undirected graph structure G = 
(V, E) with this distribution, with the vertex set 
V — {1,2,. corresponding to the variables 
(Xi, . . . , Xp), and with edge set such that (i, j) ^ E if 

e^ = o. 

We are interested in the problem of recovering this un- 
derlying graph structure, which corresponds to deter- 
mining which off-diagonal entries of 0* are non-zero — 
that is, the set 

E{e*) := {i,jeV\ i=^j,e*j^O}. (2) 

Given n samples, we define the sample covariance ma- 
trix 



n 

_ i^x(^)(X(^))^. 



(3) 



k=l 



In the sequel, we occasionally drop the superscript n, 
and simply write S for the sample covariance. 

With a slight abuse of notation, we define the sparsity 
index s := |£^(9*)| as the total number of non-zero 
elements in off-diagonal positions of 6*; equivalently, 
this corresponds to twice the number of edges in the 
case of a Gaussian graphical model. We also define the 
maximum degree or row cardinality 



d := max 

i=l,...,p 



{jGV I e:,.^o} 



(4) 
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corresponding to the maximum number of non-zeros in 
any row of 9*; this corresponds to the maximum de- 
gree in the graph of the underlying Gaussian graphical 
model. Note that we have included the diagonal entry 
in the degree count, corresponding to a self- loop 
at each vertex. 

2.2 State of the art: li regularization 

Define the off- diagonal ii regularizer 

lieiii.off := Eie,,|, (5) 

where the sum ranges over all i,j = with 
i j. Given some regularization constant > 0, 
we consider estimating 6* by solving the following £i- 
regularized log -determinant program: 

e:=arg min {({9, E")) - log det(e) + A„||e||i,off }, 

065^ 

which returns a symmetric positive definite matrix 0. 

Note that this corresponds to the ii regularized Gaus- 
sian MLE when the underlying distribution is Gaus- 
sian. 

2.3 Forward Backward Greedy 

p!6| [7] consider a simple forward-backward greedy al- 
gorithm for model estimation that begins with an 
empty set of active variables and gradually adds (and 
removes) variables to the active set. This algorithm 
has two basic steps: the forward step and the back- 
ward step. In the forward step, the algorithm finds 
the best next candidate and adds it to the active set 
as long as it improves the loss function at least by e^, 
otherwise the stopping criterion is met and the algo- 
rithm terminates. Then, in the backward step, the 
algorithm checks the influence of all variables in the 
presence of the newly added variable. If one or more 
of the previously added variables do not contribute 
at least ues to the loss function, then the algorithm 
removes them from the active set. This procedure en- 
sures that at each round, the loss function is improved 
by at least (1 — iy)es and hence it terminates within a 
finite number of steps. 

In the sequel, we will apply this greedy method- 
ology to Gaussian graphical models, to obtain two 
methods: (a) Greedy Gaussian MLE, which applies 
the greedy algorithm to the Gaussian negative log- 
likelihood loss, and (b) Greedy Neighborhood Estima- 
tion, which applies the greedy algorithm to the local 
node-conditional negative log-likelihood loss. 



Algorithm 1 Global greedy forward-backward algo- 
rithm for Gaussian covariance estimation 

Input: S^, Stopping Threshold e^-, Backward Step 

Factor u e (0, 1) 

Output: Inverse Covariance Estimation Q 

Initialize O^^) ^ I, S^^^ ^ 0, and A: ^ 1 

while true do {Forward Step} 

((i*, j*), a*) ^ — arg min C (o''^~^^-\-a{eij+eji)) 

if Sf^ < es then 

break 
end if 

B*^^^ i — arg min C{Og(k)) 
ki — k + l 

while true do {Backward Step} 

(i\n ^ arg min C (q^^^-^^ - G^f ^>(e., +e,o) 

if /:(e('=-i)-e(J-^'(e..,. + e,...))-^(e<'=-^))> 

^5^1^^ then 

break 
end if 

^ — argmin £(6^(fc-i)) 

ki — k-l 
end while 

end while 



3 Greedy Gaussian MLE 

In Algorithm [T] we describe the greedy algorithm of 
p!6| |7] as applied to the Gaussian log-likelihood loss, 

£(9) :=((e, S"))-logdet(e). 
Assumption: 

Let p > 1 be a constant and A G R^^^ be a sym- 
metric matrix that is sparse with at most vfd non- 
zero entries per row (and column) for some 77 > 
2 + 4.p^{^{p'^ - p)/d + V^)^. We require that pop- 
ulation covariance matrix S* = E [XX -^j satisfy the 
restricted eigenvalue property, i.e., for some positive 
constants Cmin, we have 

Crnu. ||A||^ < ((S*, A)) < pC^in ||A||^ , 

where, || • \\f denotes the Frobenius norm. 
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Lemma 1. Suppose S* satisfies the assumption in\^ 
Then, with probability at least 1 — ci exp(— C2n) for 
arbitrary small constant a > 0, we have that for any 
symmetric matrix A with rjd non-zero entries per row 
(and column), 

(1 - a)C™„ \\A\\p < ((£", A)) < (1 + a)pCmin WMf . 

provided that n > K d log(p) for some positive con- 
stant K , ci and C2. 

Proof The proof follows from Lemma 9 (Appendix K) 
in [11]. 

□ 

Using Taylor series expansion, we can write 

c{e + A) = £(6) + ((A, £-)) - {{e-\ A)) 

+ ^^^(((6-^Ari6-\ A)). 

i=2 ^ 

V ' 

Ra 

In order to establish the restricted strong con- 
vexity/smoothness required by [7^, we need to 
lower /upper bound Ra- Notice that in the proof of 
[7], the required A is the difference between the target 
variable O* and the k^^ step estimation 0^^^. Since the 
algorithm is guaranteed to converge, A = 0* — O^^^ is 
always bounded. Thus, without loss of generality, we 
assume that ||A||i? < 1. Notice that we can scale ||A||i? 
and similar type of result holds. The next lemma pro- 
vides the required upper/lower bound. 
Lemma 2. Suppose S* satisfies the assumption in\^ 
Then with probability at least 1—ci exp(— C2n)^ we have 
that for any symmetric matrix A with rjd non-zero en- 
tries per row (and column), and with \\A\\f < 1, 

\ciJ\A\\%<R^<^-p^CiJ\A\\%. 
Proof. Denote 7 = ((e-\ A)). We have 

Under our assumption, Cmin||A||i? < 7 < pCminll 
and the function 7 — log(l+7) is an increasing function 
in 7. Moreover, for the range of 7, we have 7 — log(l + 
7) > 1 7^ because they both vanish at zero and the 
derivative of LHS is larger than the derivative of LHS. 
Hence, we have 

J^^minllAII^ < C^Mf - log(l + C„,in||A||f ) 

< 7 - log(l + 7) = i?^ 

< pC^in\\A\\F - log(l + pC^s^\\A\\f) 



The last inequality follows from 7 — log(l + 7) < ^7^ 
(since they are equal at zero and the derivative of RHS 
is always larger above zero). Hence, the result follows. 

□ 

Let V^^) := - (6*)-^||oo. By first order condition 
on the optimality of 0*, it is clear that lim^^oo V^"^^ = 
0. The following lemma provides an upper bound on 

Lemma 3. Given the sample complexity n > K \og{p) 
for some constant K , we have 

V(") < c./^, 
V n 

with probability at least 1 — Ci exp(— C2n) for some pos- 
itive constants c, ci and C2 • 

Proof. The proof follows from Lemma 1 in [10]. 

□ 

This entails that the restricted strong convexity and 
smoothness (i.e., the required assumptions of the gen- 
eral result in [7j) are satisfied. Now, we can specialize 
the results in [7 to obtain the following theorem: 

Theorem 1 (Global Greedy Sparsistency) . Under the 
assumption above, suppose we run Algorithm [7] with 
stopping threshold es > (2cri / p^)d\og{p) / n, where, d 
is the maximum node degree in the graphical model, 
and the true parameters 0* satisfy min^^^* |0*| > 
^/8es/~^, and further that number of samples scales 
as 

n > K d log(p) 

for some constant K. Then, with probability at least 
1 — ci exp(— C2n)^ we have 

(a) No False Exclusions: E* - E = 9. 

(b) No False Inclusions: E- E* =iD. 

4 Greedy Neighborhood Estimation 

Denote by A/"* (r) the set of neighbors of a vertex r G 
so that A/'*(r) = {t : (r,t) G E*}. Then the graphi- 
cal model selection problem is equivalent to that of 
estimating the neighborhoods JVn{r) C so that 
^J^n{r) = A/'*(r);Vr G ^ 1 as n ^ 00. 

For any pair of random variables Xr and X^, the pa- 
rameter Qrt fully characterizes whether there is an 
edge between them, and can be estimated via its 
conditional likelihood. In particular, defining := 
{G^t}t/r, our goal is to use the conditional likelihood 
of Xr conditioned on Xyy to estimate the support of 
Qr and hence its neighborhood A/'(r). This conditional 
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distribution of Xr conditioned on generated by 

([T]) is given by (considering B~-^ = S) 



]Xv\r (^ — ^r\r^\r\r Xv\r^ ^rr r\r^\r\r^\r 



However, note that we do not need to estimate the 
variance of this conditional distribution in order to 
obtain the support of 0^ = Or\r- In particular, the 
solution to the following least squares loss 

r: = argminE[(X, - ^r,,X,)'], 

t^r 

would satisfy supp(r*) = supp(0*). 

Given the n samples X^^\ . . . we thus use the 

sample-based linear loss 

^i^r) = l±(xi^^-Y^T^,xA . (7) 

i=l \ t^r J 

Adapting the greedy algorithm from the previous sec- 
tion to this linear loss at each node thus yields Algo- 
rithmic 



Assumption: 

Let p > 1 be a constant and A G R^~^ be an arbitrary 
T^d-sparse vector, where, > 2 + ^p^{\/ (p^ — p)/d + 
ypif' . We require the marginal population Fisher in- 



formation matrix S 



\r 



E 



satisfy the re- 



stricted eigenvalue property, i.e., for some positive con- 
stants Cmin, we have 

C„,i„||A||f < \\Y.\Mf < pCmin\\A\\F. 

Lemma 4. Under assumption above, and for some 
arbitrary small constant a > 0^ the marginal sample 
Fisher information matrix T^y = ^^^^-^XyXy^, 
with probability at least 1 — Ci exp(— C2n)^ satisfies the 
condition that for any symmetric matrix A with rjd 
non-zero entries per row (and column), 

(1 - a)C^in||A||^ < ||£^,A||^ < (1 + a)pC^in||A||^, 

provided that n > K d log(p) for some positive con- 
stant K, c\ and C2. 



Proof The proof follows from Lemma 9 (Appendix K) 
in [11]. 

□ 



1 ( y-(i) y-(i) 



Let Vr"'^ := max^ 

By first order condition on the optimality of F*^, it 

in) 

is clear that lim^^oo Vr =0. The following lemma 
provides an upper bound on V 



(n) 



Algorithm 2 Greedy forward-backward algorithm for 
marginal Gaussian covariance estimation 

Input: Data Vectors X^^), . . . , X^^), Stopping 
Threshold e^. Backward Step Factor v ^ (0, 1) 
Output: Marginal Vector F^ 

Initialize f^r^ ^ 0, S^^^ ^ 0, and ^ 1 
while true do {Forward Step} 



(t^,a^) 



arg mm 

te(s('«-i)) 

-s^^-^^u{t4 



(fc-i) 



if < es then 

break 
end if 



p(fc) 



argmin £((Fr) s(k)) 



k+1 

while true do {Backward Step} 
e ^ arg min Cif^J'-^^ - ] 
^(fe-i) _^{k-i) 



-(fc-i) 



if £(Fr 

break 
end if 



F^^-^^e,*)-£(Fr^O >^^/ 



(fc) 



then 



-(fc-i) 

ki — k-l 
end while 
end while 



argmin C{{Vr)s{k-i)] 



Lemma 5. Given the sample complexity n > K log(j9) 
for some constant K, we have 



log(p) 



with probability at least 1 — ci exp(— C2n) for some pos- 
itive constants c, ci and C2 . 



Proof The proof follows from Lemma 5 in 



□ 



This entails that the restricted strong convexity and 
smoothness (i.e., the required assumptions of the gen- 
eral result in [7]) are satisfied with constants Cmin and 
pC'min, respectively; because, the third and higher or- 
der derivatives are zero. Now, we can then specialize 
the results in [7 to obtain the following theorem: 

Theorem 2 (Neighborhood Greedy Sparsis- 
tency). Under the assumption above, suppose 
we run Algorithm [1| with stopping threshold 
^ {8cpr]/Cmin)dlog{p)/n, where, d is the maximum 
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node degree in the graphical model, and the true pa- 
rameters satisfy min^^^*(^) |r*J > ^/32pe^/C~^, 
and further that number of samples scales as 

n > K d log(p) 

for some constant K . Then, with probability at least 
1 — ci exp(— C2n); we have 

(a) No False Exclusions: E* - Er = (D. 

(b) No False Inclusions: % - E* = 0. 

5 Comparisons to Related Methods 

In this section, we compare our global and local greedy 
methods to the ^i-regularized Gaussian MLE, ana- 
lyzed in [To], and to ^i-regularization (Lasso) based 
neighborhood selection, analyzed in [8) [T4]. 

5.1 Sample Complexity 

Our greedy algorithm requires 0{d log(p)) samples to 
recover the exact structure of the graph for both the 
global and local neighborhood based methods. In 
contrast, the £i-regularized Gaussian MLE [10 re- 
quires 0{d^ \og{p)) samples to guarantee structure 
recovery with high probability. The linear neigh- 
borhood selection with ^i-regularization [8 requires 
0{d log(p)) samples to guarantee sparsistency, similar 
to our greedy algorithms. 

5.2 Minimum Non-Zero Values 

The ^i-regularized Gaussian MLE imposes the model 
condition that the minimum non-zero entry of S*"^ 
satisfy S^^^ = 0{l/d). Our greedy algorithms al- 
low for a broader range of minimum non-zero values 
^m7n =0{l/Vd). The linear neighborhood selection 
with ^i-regularization again matches our greedy algo- 
rithms and only requires that S^^^ = 0(l/^/d). 

5.3 Parameter Restrictions 

We now compare the irrepresentable and restricted 
eigenvalue and smoothness conditions imposed on the 
model parameters by the different methods. 

5.3.1 Star Graphs 

Consider a star graph ^(V, E) with p nodes in Fig |l(a)[ 
where the center node is labeled 1 and the other nodes 
are labeled from 2 to p. Following [10 , consider the 
following covariance matrix S* parameterized by the 
correlation parameter r G [—1, 1]: the diagonal entries 
are set to H*^ = 1, for all i ^V] the entries correspond- 
ing to edges are set to S*- = r for (i, j) G E] while the 



3 




(a) Star (b) Chain 

12 3 Vp 




3 



(c) Grid (d) Diamond 

Figure 1: Generic Graph Schematics 

non-edge entries are set as S*^- = for (i, j) ^ E. It is 
easy to check that H* induces the desired star graph. 
With this setup, the irrepresentable condition imposed 
by the -regularized Gaussian MLE [10 entails that 
|r|(|r| + 2) < 1 or equivalently r G (-0.4142,0.4142) 
to guarantee sparsistency. However, our greedy algo- 
rithms allow for r G (—1,1) (since Cmin = 1 ~ '^^)- 
Under the same setup, the linear neighborhood selec- 
tion with ^i-regularization [8^ requires r G (—1,1) to 
guarantee the success. 

5.3.2 Chain Graphs 

Consider a chain (line) graph QiV^E) on p nodes as 
shown in Fig |l(b)[ Again, consider a population co- 
variance matrix S* parameterized by the correlation 
parameter r G [-1,1]: set S*^- = rl*"-^l. Thus, this 
matrix assumes a correlation factor of between two 
nodes that are k hops away from each other. It is 
easy to check that S* induces the desired chain graph. 
With this setup, the ^i-regularized Gaussian MLE [10] 
requires |r|^~^ {{p — 2)|r| + p — 1) < 1. It is hard to 
evaluate bounds on r in general, but for the case p = A 
we have r G (—0.6,0.6); for the case p = 10 we have 
r G (—0.75,0.75) and for the case p = 100 we have 
r G (—0.95, 0.95). Our greedy algorithms on the other 
hand allow for r G ( — 1, 1) (since Cmin = — ^'^)fp{^) 
for some function /p(r) that depends on p and satisfies 
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fp{r) > Cp for all r and some constant Cp depending 
only onp). Under the same setup, the linear neighbor- 
hood selection with ^i-regualrization [8 only imposes 
r G (—1, 1), similar to our greedy methods. 

5.3.3 Diamond Graph 

Consider the diamond graph Q{V^£) on 4 nodes with 
the nodes labeled as in Fig l(d)[ Given a correlation 
parameter r 7^ 0, let S* be the population covariance 
matrix with S*- = 1 and H*^- = r except HJg = 
and = 2r^ It is easy to check that S* ^ induces 
the desired graph. With this setup, the ^1 -regularized 
Gaussian MLE [10 requires 4|T|(|T| + l)<lor equiv- 
alently r G (-0.2017,0.2017). Our greedy algorithm 
allows for r G (-0.7071,0.7071) (since Cmin = l-2r^). 
Under the same setup, the linear neighborhood se- 
lection with ^i-regualrization [8] requires 2|r| < 1 
or equivalent ly that r G (—0.5,0.5) to guarantee the 
success. Unlike the previous two examples, this is a 
strictly stronger condition than that imposed by our 
greedy methods. 

6 Experimental Analysis 

In this section we will outline our experimental results 
in testing the effectiveness of both Algorithms [l] and [2] 
in a simulated environment. 

6.1 Optimization Method 

Our greedy algorithm consists of a single variable opti- 
mization step where we try to pick the best coordinate. 
This step can be run in parallel for all single vari- 
ables to achieve maximum speedup. For greedy neigh- 
borhood selection, the single variable optimization is 
a relatively simple operation, however for the global 
model selection algorithm (log-det optimization), we 
would like to provide a fast single variable optimiza- 
tion method to avoid a continual log-det calculation. 
Following the result in [12], we have 

det (e^''-^^ + a{e^J + e,,)) = det (s^''-^^^ 

((1 + a{e^'-'^)-jr - c.^(e(^-))-(e^^-))-/) 

This entails that 

a* = argmin ((e(*=-^> + a(ey + eji), £")) 

- logdet(e('="^' + a{e^j + Cji)) 

(e(fc-i))-i(e('=-i))-/ - (e('=-i))-.\e('=-i))-.i 

This closed-form solution simplifies the single variable 
optimization step in our algorithm and avoids contin- 
ual calculation of log det (9). 




Control Parameter 

(a) Chain (Line Graph) 





Control Parameter 



(b) 4-Nearest Neighbor (Grid Graph) 

Fig 2: Plots of success probability P[/S' = 6**] versus the 
control parameter /3(n,p, d) = n/[70dlog{p)] for (a) 
chain {d = 2) and (b) 4-nearest neighbor grid {d = 4) 
using both Algorithm [l] and ^i-regularized Gaussian 
MLE (Graphical Lasso). As our theorem suggests and 
these figures show, the Global Greedy algorithm re- 
quires less samples to recover the exact structure of 
the graphical model. 



6.2 Experiments 

To present a formal experimental analysis for both 
Algorithm [1] and Algorithm |2] we simulated zero- mean 
Gaussian inverse covariance estimation, or GMRF 
structure learning, for various graph types and scal- 
ings of (n,p, d). For the Global Greedy method we 
experimented using chain {d = 2) and grid {d = 4) 
graph types with sizes of p G {36,64,100}. For 
the Neighborhood Greedy method we experimented 
using chain {d = 2) and star {d = O.lp) graph types 
with sizes of p G {36,64,100}. Figure 1 outlines the 
schematic structure for each graph type. For each 
algorithm, we measured performance by completely 
learning the true support set iS* pertaining to the 
non-zero inverse covariates (graph edges). If was 
completely learned then we called this a success and 
otherwise we called it a failure. Using a batch size 
of 50 trials for each scaling of {n^p^d) we measured 
the probability of success as the average success rate. 
For both algorithms we used a stopping threshold 
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(b) Star Graph 

Fig 3: Plots of success probability P[A/±(r) = 
A/'*(r),Vr G V] versus the control parameter 
P{n,p,d) = n/[70d\og{p)] for (a) chain {d = 2) 
and (3{n^p^d) = n/ [200 log (dp)] for (b) star graph 
{d = O.lp) using both Algorithm [2] and nodewise 
^i-regularized linear regression (Neighborhood Lasso). 
As our theorem suggests and these figures show, the 
Neighborhood Greedy algorithm requires less samples 
to recover the exact structure of the graphical model. 



es = - — where d is the maximum degree of the 
graph, p is the number of nodes in the graph, n is the 
number of samples used, and c is a constant tuning 
parameter, as well as a backwards step threshold 
of V = 0.5. We compared Algorithm [l] to that of 
-^1 -regularized Gaussian MLE (Graphical Lasso) as 
discussed in [5] and [10] using the glasso implementa- 
tion from Friedman et al. [F. We compared Algorithm 
[2] to that of neighborhood based ^i-regularized linear 
regression (Neighborhood Lasso) using the glmnet 
generalized Lasso implementation, also from Friedman 
et al. [6 . Both glasso and glmnet use a regularization 

parameter A = c^J~^^^ which was optimally set using 
/c-fold cross validation. 

Figure [2] plots the probability of successfully learning 
S* vs the control parameter ^{n^p^d) = ^odAog p 
varying number of samples n for both Algorithm [l] 
and Graphical Lasso. Figure [3] plots the probability 



of successfully learning S** vs the control parameter 
I3{n,p,d) = Yolt^ chain graph type and 

(3{n,p,d) = 2ooiog((ip) star graph type for 

both Algorithm [2] and neighborhood based £1 -linear 
regression. Both figures illustrate our theoretical re- 
sults that the Greedy Algorithms require less samples 
{0{d\ogp)) than the state of the art Lasso methods 
{0{d'^ log p)) for complete structure learning. 
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